Novel Algorithms for Computing Medians and Other Quantiles of Disk-Resident Data

نویسندگان

  • Lixin Fu
  • Sanguthevar Rajasekaran
چکیده

In data warehousing applications, numerous OLAP queries involve the processing of holistic operations such as computing the "top N", median, etc. Efficient implementations of these operations are hard to come by. Several algorithms have been proposed in the literature that estimate various quantiles of disk-resident data. Two such recent algorithms are based on sampling. In this paper we present two novel and efficient quantiling algorithms, Deterministic Bucketing (DB) and Randomized Bucketing (RB). We have analyzed the performance of DB and RB and extended the analysis of the sampling done in prior algorithms. We have conducted extensive experiments to compare all these four algorithms. Our experimental data indicate that our new algorithms outperform prior algorithms not only in the overall run time but also in accuracy. The new algorithms can be used either as one-pass algorithms to accurately estimate quantiles or as algorithms for computing the quantiles exactly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data

The cpquantile of an ordered sequence of data values is the element with rank ‘pn, where n is the total number of values. Accurate estimates of quantiles are required for the solution of many practical problems. In this paper, we present a new algorithm for estimating the quantile values for disk-resident data. Our algorithm has the following characteristics: (1) It requires only one pass over ...

متن کامل

Efficient Data Mining with Evolutionary Algorithms for Cloud Computing Application

With the rapid development of the internet, the amount of information and data which are produced, are extremely massive. Hence, client will be confused with huge amount of data, and it is difficult to understand which ones are useful. Data mining can overcome this problem. While data mining is using on cloud computing, it is reducing time of processing, energy usage and costs. As the speed of ...

متن کامل

[7] A. Asuncion and D. J. Newman. UCI Machine Learning Repository

[3] Rakesh Agrawal and Arun Swami. A one-pass space-efficient algorithm for finding quantiles. A one-pass algorithm for accurately estimating quantiles for disk-resident data. [8] Jürgen Beringer and Eyke Hüllermeier. An efficient algorithm for instance-based learning on data streams.

متن کامل

Assessment Methodology for Anomaly-Based Intrusion Detection in Cloud Computing

Cloud computing has become an attractive target for attackers as the mainstream technologies in the cloud, such as the virtualization and multitenancy, permit multiple users to utilize the same physical resource, thereby posing the so-called problem of internal facing security. Moreover, the traditional network-based intrusion detection systems (IDSs) are ineffective to be deployed in the cloud...

متن کامل

INTERVAL ANALYSIS-BASED HYPERBOX GRANULAR COMPUTING CLASSIFICATION ALGORITHMS

Representation of a granule, relation and operation between two granules are mainly researched in granular computing. Hyperbox granular computing classification algorithms (HBGrC) are proposed based on interval analysis. Firstly, a granule is represented as the hyperbox which is the Cartesian product of $N$ intervals for classification in the $N$-dimensional space. Secondly, the relation betwee...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001